Exchange Server 2010 : Achieving High Availability

12/15/2010 11:35:01 AM

A number of barriers stand in the way of achieving high availability. For example, a poor implementation of Exchange might be one where Exchange is installed on improperly sized servers and installed without following best practices. In this case it is possible to deploy an Exchange messaging environment over a short time period. This is easy to do quickly, but a lot of important details can be missed and availability will no doubt suffer.

By contrast, in a high-availability environment the messaging system deployment is well designed. During the deployment project, organizational messaging requirements are researched. The current messaging environment is examined for inadequacies and solutions are identified. Research into how best to deploy Exchange may go on for an extended period while consultants are brought in to help build a design. Vendors are also brought in to discuss how their products will work and how they can contribute to running a highly available system. Hardware is sized and tested to meet both business and technical requirements, such as service-level agreements (SLAs), recovery point objectives, and cost considerations. Hardware will be considered that has the defined level of fault-tolerant components such as redundant memory, drives, network connections, cooling fans, power supplies, and so on.

A high-availability environment will also incorporate a significant amount of design, planning, and testing. A high-availability environment will often, but not always, include additional features, such as failover clustering and load balancing, which are designed to decrease downtime by enabling rolling upgrades and allowing for a preplanned response to failures. The messaging client software and its potential configurations can also improve availability. For example, Outlook 2003 and later offers the Exchange Cached Mode configuration that allows users to create new messages, respond to existing mail in their Inboxes, and manage their calendars (among many other tasks) even if the connection is lost to the Exchange server. Cached Exchange Mode allows users to continue working locally even though the Exchange server might be down for a short time. When the connection to the Exchange server is restored, any changes made will be synchronized. In the end, all critical business systems must be analyzed to understand the cost incurred when they are unavailable. If downtime has a significant cost, the organization should take steps to minimize downtime. This is particularly true if the cost of downtime is greater than the cost of deploying a suitable highly available solution.

The opposite of availability is downtime, both planned and unplanned. Planned downtime is the result of scheduled events, such as maintenance. Unplanned downtime is the result of unscheduled events. Events that cause unplanned downtime can be minor, such as a faulty hardware driver or a processor failure, or major, such as an earthquake, fire, or flood.

1. Measuring Availability

Availability is usually expressed as the percentage of time that a service is available. As an example, a requirement for 99.9 percent availability over a one-year period of 24-hour days, 7 days a week allows for only 8.75 hours of downtime, as shown in Table 1 . In complex environments, organizations specify availability targets for each service. When dealing with an Exchange messaging environment, availability goals may be tied to specific features such as Microsoft Outlook Web App, Simple Mail Transfer Protocol (SMTP) message delivery, and Outlook MAPI connectivity. These availability targets are then turned into SLAs that hold the group operating the messaging system accountable for meeting those targets. In some cases, if those targets are not met, the salaries and bonuses of the employees and managers in the responsible group can be affected. In some instances both planned and unplanned downtime affect the overall availability target; in other environments planned downtime is exempt from the availability target. Because successfully achieving high availability includes update management to mitigate potential downtime, some planned downtime is required.

Table 1. Permitted Downtime for Specific Availability Targets
AVAILABILITY TARGET	PERMITTED DOWNTIME ANNUALLY
99 percent	87 hours, 36 minutes
99.9 percent	8 hours, 46 minutes
99.99 percent	52 minutes, 34 seconds
99.999 percent	5 minutes, 15 seconds

This bit of background should not detract from the great features provided to help achieve high availability in Exchange 2010; rather, the purpose is to provide a frame of reference as the Exchange-specific high-availability features are discussed.

2. Exchange 2010 High-Availability Features

Exchange 2010 builds on the solid foundation set by Exchange 2007 with regard to high availability. Exchange 2007 introduced a number of new options for availability, including cluster continuous replication (CCR), standby continuous replication (SCR), single copy cluster (SCC), and local continuous replication (LCR). Exchange 2010 introduces the Database Availability Group (DAG), which combines the best functionality available in Exchange 2007. A DAG is a group of up to 16 Exchange 2010 Mailbox servers that can each maintain up to 100 databases. A database may have up to 16 copies of each database using continuous replication.

The DAG differs from Exchange Server 2007 SP1 in the following ways:

With CCR, there can be only two highly available copies of the database within the cluster; within the DAG there can be up to 16 copies of each database.
With SCR, the activation process required administrative intervention; within a DAG, failover between individual database copies can happen automatically.
With SCC, a single shared copy of the database consumes less storage but provides no redundancy. Exchange Server 2010 has no configuration that replaces this functionality, although some third-party solutions may be able to provide similar functionality by using the Third Party Replication API.
With LCR, a single-server configuration allows two copies of a database to reside on different storage connected to the same server. No configuration in Exchange Server 2010 replaces this functionality.

Exchange 2010 provides database-level failover within the DAG. A single database failure no longer affects all mailbox databases on a server. Database failover time has also been improved since Exchange 2007. The DAG also makes it easier to implement site failover because now the DAG handles both in-site and inter-site replication.

Exchange 2010 also has improved non-mailbox high availability. Transport servers now have a feature called shadow redundancy, which provides redundancy for in-transit messages.

Another improvement is online mailbox moves. In previous versions of Exchange, mailboxes are moved offline which requires users to disconnect their clients in order to complete the move. Since this process impacts the users, these mailbox moves are usually scheduled during maintenance windows. Only being able to move mailboxes at night and on the weekends during a migration project does not provide enough time to complete the migration. The online mailbox move feature allows mailboxes to be moved between databases asynchronously without taking the user offline. The users will be able to maintain their connection and work while their e-mail is being moved in the background. This reduces end-user downtime and allows mailbox migrations to be performed during business hours. Online mailbox moves help improve availability for end users. More information about Exchange 2010 high-availability planning can be found in the Planning for High Availability and Site Resilience topic at http://technet.microsoft.com/en-us/library/dd638104.aspx.